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Abstract 

Most work in quantum circuit optimization has been performed in isolation from the results 
of quantum fault-tolerance. We break this trend by presenting a polynomial-time algorithm 
for optimizing quantum circuits while taking the actual implementation of fault-tolerant logical 
gates into consideration. Our algorithm resynthesizes quantum circuits composed of CNOT and 
T gates, the latter being typically the most costly gate in fault-tolerant models, e.g., those based 
on the Stcanc or surface codes, with the purpose of minimizing both T-depth and T-count. The 
tested benchmarks show up to 99% reduction in T-depth and up to 42% reduction in T-count. 

1 Introduction 

Quantum computers have the potential to efficiently solve important computational problems, e.g., 
prime factorization [23] and quantum simulation |17j , for which there are no known efficient classical 
algorithms. However, even with recent advances in quantum information processing technologies 
[7J El ETJ, the prospects of scalable quantum computing without some systematic way of mitigating 
physical errors and noise are bleak. 

The active and vibrant fields of quantum error correction and fault-tolerance provide such 
tools for constructing scalable quantum computers. By combining physical qubits through the 
use of error correcting codes and providing fault-tolerant logical operations, larger computations 
can be achieved with high fidelity - by concatenating codes, or in topological codes by increasing 
code distance - provided the physical operations achieve a certain threshold fidelity. With recent 
improvements to fault-tolerant thresholds [5] 114} [1~5]. scalable quantum computation is becoming 
more and more viable, resulting in a growing need for efficient automated design tools targeting 
fault-tolerant quantum computers. 

Quantum circuit synthesis and optimization is particularly important, given the prevalence of 
the circuit model of quantum computation, but previous work has been largely isolated from the 

'This author is currently with the National Science Foundation, Arlington, VA. 

^This author is also with the Perimeter Institute for Theoretical Physics, Waterloo, ON. 



1 



unique concerns of fault-tolerance. While at the physical level, coupled gates are generally the 
hardest to perform, most of the common quantum error-correcting codes have efficient CNOT 
implementations. Moreover, for fault-tolerant models based on (double even, self-dual) CSS codes, 
e.g., the popular Steane code, as well as the promising surface codes, the Clifford group can be 
implemented as logical gates with little cost [TU [25] . 

For universal quantum computing, however, at least one non-Clifford group gate is needed, 
which typically requires large ancilla factories to implement fault-tolerantly. Since the non-Clifford 
T gate has known constructions in most of the common error correction schemes, the standard 
universal fault-tolerant gate set is taken to be "Clifford + T". Given the high cost of the fault 
tolerant implementations of the T gate [IJQ3], exceeding the cost of Clifford gates by as much as 
a factor of a hundred or more, it has recently been proposed that efficient circuits should minimize 
the number of T gates, and more specifically the number of T gates that cannot be performed in 
parallel [21 HI EJ [22]. We define these metrics as a circuit's T-count and T-depth, respectively. 
In our work, we targeted the design of algorithms to optimize T-depth, and considered T-count 
optimization as secondary. 

Some recent work has been done concerning minimization of T-depth [21 [22], though these 
previous results focus finding small optimal two- and three-qubit circuits [2], and on the existence 
of T-depth 1 implementations of some (e.g., up to relative phase) variants of the three-qubit Toffoli 
gate [22] • By contrast, we report a scalable automated tool for the optimization of T-depth not 
limited to a few qubits or a specific gate. In particular, we present a polynomial-time algorithm for 
optimizing both the T-depth and T-count of quantum circuits composed of {CNOT, T} gates. The 
algorithm also makes automated use of ancillae to optimize T-depth, with the addition of ancillae 
typically decreasing runtime of our software implementation. While most interesting fault-tolerant 
quantum circuits include other gates, certain important classes of circuits, such as arithmetic and 
more generally reversible circuits, contain large sub-circuits over {CNOT, T}. Our experiments 
using the available benchmarks show on average 55% reduction in T-depth and 26% reduction 
in T-count without adding any ancillae. When the use of ancillae is allowed, the average T- 
depth reduction is demonstrated to be as high as 86% (the more ancillae are allowed the more 
parallelization becomes possible). 

The rest of the paper is organized as follows: Section 2 describes the mathematical notation and 
definitions used throughout the paper; Section 3 introduces and characterizes {CNOT, T} circuits; 
Section 4 provides a reduction from T gate parallelization to matroid partitioning, for which there 
exists a polynomial-time algorithm; Section 5 describes and analyzes the full algorithm. In Section 
6 we report our experimental results and benchmarks, and Section 7 concludes the paper. 

2 Preliminaries 

We begin by reviewing some basic facts about quantum and reversible circuits necessary for this 
paper. 

The quantum circuit model, one of the most prominent models of quantum computation [16] . 
describes the state space of a system of n qubits as a vector in a 2 n -dimensional complex vector 
space H, and quantum gates as unitary operators on the vector space - operators U such that 
WU = UU^ = I, where U* denotes the adjoint of U. By convention, we represent vectors in the 
standard or computational basis of H with binary strings of length n, and write them in Dirac 
notation: \x) where x G {0, l} n . A state of the system then corresponds naturally to a linear 
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complex combination of computational basis vectors. 

By contrast, in the classical circuit model, the state of a system of n bits is represented itself 
as a binary string of length n, with classical gates corresponding to operators that map length n 
binary strings to binary strings of length m. More precisely, the set of all length n binary strings 
are the vectors of F?;, where F2 is the two-element finite field with addition corresponding to logical 
exclusive-OR (ffi) and multiplication corresponding to logical AND (A). We then represent classical 
gates as operators / : Fg — > ¥ 2 n , and we typically refer to / as a (classical) function. For brevity, 
if m = 1 we call / Boolean. 

Given that unitary operators are linear, invertible operators over T~L, we see that the subset of 
quantum gates that permute the computational basis states are exactly the set of invertible classical 
gates - we call such gates reversible. The Toffoli gate 

TOF : \a)\b)\c) t-> \a)\b)\(a A b) © c), 

is one such example of a reversible classical gate. 

While all reversible classical gates are linear as operators over 7i, they need not be linear as 
operators over Ff. In particular, we call / : F2 — > ¥ 2 n linear if f(x © y) = f(x) © f{y). The 
controlled-NOT gate 

CNOT : \a)\b) h-> \a)\a(Bb), 

is an example of a linear reversible gate. It is a known result that the set of all linear reversible 
functions are those that can be computed by a circuit consisting of only CNOT gates |20| . 

Throughout this paper, we will be particularly interested in linear Boolean functions and their 
relation to linear reversible functions. For convenience, we refer to a linear Boolean function 
/ : Fr, — > ¥2 as a row vector over Fg - i.e., / = x T for some iGFj. Furthermore, for a set of linear 
Boolean functions S, we define rank(S') as the maximum number of independent (row) vectors in 
S. 

Lemma 1. Fix a subspace V of Given a set of linear Boolean functions S = {fx, fi - ■ ■ fn} C 
FrJ — > F2, the linear function f : V W defined as 

f : \aia 2 .-.a n ) (->■ |/i(ai, . . . , a n ) . . . f n (ai, . . . , a n )) 

is reversible if and only if rank(S') = dim(V). 

Proof. Reversibility follows if and only if rank(S') = dim(V) due to a straightforward application 
of the rank- nullity theorem; dim(im(/)) + dim(ker(/)) = dim(V). As the functions in S form the 
row vectors of /, rank(S') = dim(im(/)) and so / is one-to-one if and only if rank(S') = dim(V). 
Thus / is invertible when restricted to its image, i.e., there exists / _1 : im(/) — > V such that 
// 1 / '/ /• □ 

While we defined a reversible computation as an invertible operator, in a more informal sense 
reversibility only requires that the computation can be "undone." For this reason, we only need 
existence of the inverse map f~ l : im(/) — > V as above for the operator / to be reversible. 

We call a set S C F2 — > ¥2 (reversibly) computable with respect to subspace V of F2 if there 
exists a superset S' of S with cardinality n such that rank(S") = dim(V). 
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3 {CNOT, T} circuits 



In this paper, we focus on optimizing a subset of quantum circuits composed of CNOT and T := 
( hl | gates, denoted {CNOT,T} circuits. While we work with {CNOT,T} circuits, it should 

T 2 , Z := ^ = T 4 , T^, pt, Z^} - this distinction is important in realistic computations, as the 

P (Phase) and Z (Pauli-Z) gates, along with their inverses, have efficient logical implementations 
in most fault tolerant architectures |144 [25] . 

We first describe the set of unitaries implementable by {CNOT,T} circuits [2|: 

Lemma 2. Unitary U G U{2 n ) is exactly implementable by an n-qubit circuit over {CNOT,T} if 
and only if 

U\a\a 2 ...a n ) = w t \g(ai,a 2 , -,a n )) 

where oj = e~^ and t = Y2i=i c i ' fi{ a ii ■■■■> a n) for some linear Boolean functions f\, f 2 , fk with 
coefficients c\, c 2 , c\- G Zs and linear reversible function g. 

A full proof is contained in [2] . We generalize the lemma by grouping together identical functions 
fi and placing their multiplicity in coefficients. As a result, we can fully characterize any unitary 
U G U(2 n ) implementable by a {CNOT, T} circuit with a set of linear Boolean functions S C F?J — > 
¥ 2 , a coefficient o L G Zg for each fi G S, and a (linear, reversible) output function g : F?> — > FJ>, 
with the interpretation U : \a\a 2 ...a n ) *->■ ^ f i esCl ' fl ^ ai,a2, '" ,an ' ] \g{ai,a 2 , ...,a n )). 

The natural intuition gained from Lemma [2] is that any {CNOT, T} circuit can be written as a 
sequence of alternating stages where either CNOT or T gates are applied. Indeed, if some subset 
S' C S is computable with respect to the input subspace, then we can first compute S' with a 
CNOT stage, then apply the phase rotation oj^h^s' c * A w ith a stage of T gates. As each /j G 5" is 
computed with a different qubit, the phase computations can be performed in parallel with depth 
maxj.gs/ Ci, and if P and Z are implemented as logical gates, then this T stage can be written in 
T-depth 1 using at most \S'\ T gates; for the remainder of this paper, we will use this gate set. 

As a trivial consequence, any unitary U implementable over {CNOT, T} can be implemented 
in T-depth k where k is the minimum number of sets partitioning S into computable subsets. 
By Lemma [TJ we know a set of linear Boolean functions {fi,f 2 ,---,f n } is computed by a linear 
reversible function on a subspace V of FJ> with dimension m if rank ({fi, f 2 , ■ ■ ■ , f n }) = m. In the 
case when n = m (e.g., the circuit has no ancilla qubits), we see that {f\, f 2 , . . . , f n } must be 
linearly independent, and thus a subset S' C S is computable by adding an additional n — \S'\ 
functions if and only if S' is linearly independent. As a result, the T-depth is given by a minimal 
partitioning of S into linearly independent sets. 

However, in cases when n > m (e.g., the circuit contains n — m ancilla qubits) we have more 
freedom in choosing computable subsets of functions. For any set S = {f\, f 2 , . . . , //%}, we informally 
note that S is computable if and only if we can add linear Boolean functions fk+i, ■ ■ ■ , fn to S so 
that rank ({/1, f 2 , ■ ■ ■ , fn}) = m- We formalize this condition in the following lemma: 

Lemma 3. Given a subspace V of Fg with dimension m and set of linear Boolean functions 
S C F2 — > ¥ 2 , there exists a superset S' of S with cardinality n such that rank (S') = m if and only 
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if 

m-rank(S) < re- \S\. (1) 

Proof. The proof of Lemma [3] follows from basic linear algebra. 

Suppose there exists a superset S' of S with cardinality re and rank(5") = m. Then S' \ S 
contains at least m — rank(S') linearly independent linear Boolean functions, and so |<S" \ S\ = 
n — > m — rank(S). 

Suppose instead that m— rank(S') < n — \S\. Then any maximal linearly independent subset of S 
can be extended to a basis of V by adding m— rank(S') linear Boolean functions (i.e., unit row vectors 
in F?j). As a result there exists a rank m superset S' of S with cardinality IS"! +m — rank(S') < n. □ 

It can be seen that inequality ([!]) implies |5| = rank(S'), i.e., S is linearly independent, when 
n = m. 

4 Minimal partitioning 

We now turn our attention to the problem of determining a minimal partition of a set of linear 
Boolean functions into computable sets. To do so, we first introduce the concept of a matroid, an 
algebraic structure that generalizes the idea of linear independence in vector spaces. 

Definition 1. A finite matroid is a pair (S, I) where S is a finite set and / is a set of subsets of S 
such that 

1. G /. 

2. For all A, B C S, if A G / and B C A, then Bel. 

3. For all A,B G /, if \A\ > \B\, then there exists some a G A such that B U {a} G /. 

The matroid partitioning problem is defined as follows: given a matroid (S,I), find a partition 
{Si, S2, ■ ■ ■ , Sk} of S such that S{ G I for each 1 < i < k minimizing k. As a well known result, 
matroid partitioning can be solved in polynomial time, given an independence oracle for the matroid 
|12j . Naturally, we would like to reduce partitioning sets of linear Boolean functions to matroid 
partitioning; to do so, we show that the independence condition given as inequality ([T]), together 
with the set of linear Boolean functions forms a matroid: 

Lemma 4. For any subspace V of with dimension m and set of linear Boolean functions 

5 = {/1, /2, . . . , fk} C V — > ¥2, let I denote the set 

{A C S\m - rank (.4) < n- \ A\}. 
The pair {S, I) is a finite matroid. 
Proof. We verify that (S, I) satisfies all three conditions of Definition [TJ 
1. m— rank(0) < n — |0| is trivially true since m < re. Thus G /. 
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2. Suppose A, B C S, where A £ I and B C A. Clearly rank(A) — rank(l?) < \A\ — \B\. Since 
m — rank(^4) < n — \A\ we see that 

m <n + rank(A) — \A\ < n + rank(S) — \B\ 

and thus m — rank(£>) < n — \B\. 

3. Suppose A,B e I and \A\ > \B\. If rank(^) < rank(B), then 

m — rank(S) < m — rank(^4) < n — \ A\ < n — \ B\. 

Otherwise, let A 1 and B' be maximal linearly independent subsets of A and B, respectively. 
As a result from linear algebra and the fact that \A'\ > \B'\, there exists an element s £ A' 
such that B' U {s} is linearly independent. Then 

m — rank(i? U {s}) = m — rank(i?) — 1 
<n - \B\ - 1 
= n- \BU{s}\. 

□ 

As a consequence of Lemmas [3] and |4j we see that the problem of finding a minimal partition 
of a set of linear Boolean functions into computable sets is reducible to the matroid partitioning 
problem, and thus can be solved in polynomial time given an independence oracle. We additionally 
note that the condition in Lemma [3] can be checked for a given subset S using forward elimination 
to compute the matrix rank. As arithmetic is performed over the finite field F2 there can be no 
exponential blowup in the number of bits, so forward elimination has time complexity 0(n 3 ). 

We now describe an algorithm for matroid partitioning based on an algorithm due to Edmonds 
|12j . The basis of the algorithm is the construction of a directed graph G s given a matroid element 
s and the current matroid partition P. In particular, the vertices of G s correspond to elements of 
the matroid that are already partitioned, as well as distinguished vertices _L p for each p E P. There 
exists a directed edge u — > v between two nodes if for p' £ P such that u G p' , (p' \ {u}) U {v} is 
an independent set, and a directed edge _L p — > u if p U {u} is an independent set. It can then be 
shown that P can be updated to include s if and only if there exists a path from some partition 
_L p to s in G s [12] . If there does exist a path, the current partition P is updated according to the 
shortest such path. 

Rather than generating the graph G s explicitly for each s, we perform a breadth-first search 
from s on the implicit graph (Algorithm [I]) . It is well known that the time complexity of breadth- 
first search is 0(|-K| + |V|) for a graph with edge set E and vertices V - as we explicitly test a 
set inclusion in / by computing matrix rank for each possible edge in graph G s , the breadth first 
search requires time in 0(|y| 2 • n 3 + \V\) = 0((2i) 2 ■ n 3 + 2i) where i is the number of elements of 
S already partitioned. Finally by noting that X^i. 5 ' * 2 = ^3 — h — h ^p, we see that Algorithm [l] 
has time complexity OdS"! 3 • n 3 ). 
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Algorithm 1 Matroid partitioning algorithm 
function PARTITION^, I) 

/* I denotes the independence oracle */ 

Set P := 

for each s£S do 

Create path queue Q, Q.enqueue(s — > 0) 
Unmark each element of S, mark s 
while Q non-empty do 
t := Q.dequeue() 
for each p £ P do 

Setp' :=pU{head(t)} 
if p' € / then 
Set p := p' 

for each u — > v in path t do 

Replace u with v in its current partition 
end for 
else 

for each unmarked q G p do 
if p' \ {q} £ I then 
Q.enqueue(g — > t) 
Mark q 
end if 
end for 
end if 
end for 
end while 

If no path was found, set P := P U {s} 
end for 
return P 
end function 



5 Automated T parallelization 

We now describe our main result, an algorithm for re-synthesis of {CNOT, T} circuits minimizing 
T-depth and T-count. We note that the input space of a circuit with n data qubits and m ancillae 
is a dimension n subspace V of ¥% +m . 

Given a {CNOT, T} circuit C with n data qubits and m ancillae we first parse the circuit 
to characterize the unitary computed by C - i.e., build a set S of linear Boolean functions ft and 
coefficients Cj as they appear in the exponent of a;, as well as the output linear reversible function g. 
This step can be performed in time 0(|C| • (n + m)), where |C| denotes the number of gates in C, by 
storing each wire's state as a linear Boolean function, then applying CNOT : \fi)\f2) |/i)|/i©/2) 
with a bitwise exclusive-OR, and T : \f) i— > uj* \f) by updating the set 5 and relevant coefficient. 

After parsing the circuit, we generate a partition P of S using Algorithm[T]with the independence 
relation A C S G / if and only if n — rank(A) < (n + m) — \A\\ this step takes time OdS"! 3 • (n + m) 3 ) 
as per Section [4} Assuming the target fault tolerant architecture admits logical Phase and Pauli-Z 
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gates, the T-depth of the resulting circuit is given by \P\. 

Once we have a partition P, we reconstruct a circuit by taking each set p £ P, synthesizing 
a CNOT circuit computing p, applying the necessary phase rotations, then uncomputing p. In 
order to synthesize such a CNOT circuit, we first add (n + to) — \p\ linear Boolean functions so 
that the resulting set p' has rank n - this can be accomplished by performing forward elimination, 
then adding linear Boolean functions corresponding to missing pivots. Then we synthesize a circuit 
computing p'^ 1 by performing Gaussian elimination on the row vectors of p' in 0((re+m) 3 ) time, the 
operations of which correspond to the application of CNOT gates; reversing the steps additionally 
provides a circuit for computing p'. As \P\ < \S\ we see that this step has time complexity 
0(|5| • (n + to) 3 ). The entire algorithm, shown in Algorithm [2J thus runs in time 0(|C| • (n + to) + 
\S\ 3 ■ (n + rn) 3 ). 

Algorithm 2 T-parallelization algorithm 
function Tpar({CNOT, T}-circuit C) 

/* C contains n data qubits, to ancilla */ 
Circuit = 
Compute t, g s.t. 

C\aia 2 ■ ■ - a n+m ) ^ u) t \g(ai,a 2 , . . . ,a n+m )), 
t = J2i=i Ci ■ fi(ai,a,2, ■ ■ ■ a n + m ) 
Set S := {/i,/ 2 ,---,/fc} 

I := {A C S\n - rank(vl) < (n + to) - \A\} 
Set P := Partition^, I) 
for each p G P do 

Compute p' 5 P s.t. rank(p') = to, \p'\ = n 
Synthesize CNOT circuit C\ computing p' 

c-f 

Synthesize T circuit C2 for applying the phase rotation uj f i ep 1 
Append CiC^C} to Circuit 
end for 

Synthesize CNOT circuit computing g 

and append to Circuit 
return Circuit 
end function 

It should be noted that our Gaussian elimination-based implementation produces circuits that 
are non-optimal in terms of the number of gates or depth. While our focus was on the paral- 
lelization of T gates, there exist algorithms, |181 120|. that produce more efficient circuits for linear 
reversible functions. Specifically, [20J provides an algorithm to synthesize linear reversible circuits 
with 0(n 2 /log(n)) gates, and |18j reports an 0(n)-depth algorithm. 

6 Experimental results 

We implemented Algorithm [2] in C++ and applied it to optimize various quantum, specifically 
arithmetic, circuits from the literature. Individual circuits were written in the standard fault- 
tolerant universal gate set {H = ^ ^| ^^j , CNOT,T}, using the decompositions found in [2] 
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where applicable. In particular, as most arithmetic circuits are dominated by Toffoli gates, we 
used a T-depth 3 implementation of the Toffoli gate (Figure [I]) . Since H gates are present in most 
fault-tolerant quantum circuits, we applied our algorithm by grouping together large {CNOT,T} 
sub-circuits and optimizing them individually. 
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Figure 1: T-depth 3 decomposition [2j of the Toffoli gate with the target on qubit 3. 

Results are reported in Tables [l] and [2] They were generated in Debian Linux running on a 
quad-core 64-bit Intel Core i5 2.80GHZ processor and 16 GB RAM. T-depth for each circuit before 
optimization was computed by maximally parallelizing the Toffoli gates, then writing each group 
of parallel Toffoli gates in T-depth 3. 
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Table 1: Benchmark circuits, n specifies the number of qubits in both the original and optimized 
circuit. T-count and T-depth give the T-count, and T-depth of the original circuit, while T-count' 
and T-depth' give the T gate counts and T-depth of the optimized circuit. x' z and x' p report Z and 
P counts; all unoptimized circuits have Z and P counts of zero. Other gates are not listed as we 
focus strictly on T gate complexity. Reductions in T-count ranged from 0-42.2% with an average 
of 26.4%, while T-depth reductions ranged from 0-86.6% with an average of 55.4%. The Mod 54 
and GF(2 m ) circuits can be found at http://webhome.cs.uvic.ca/ dmaslov/. 

The benchmarks ( Tables [l] and [2]) show that the performance is strongly affected by the structure 
of the original circuit. In particular, the algorithm optimizes circuits where adjacent Toffoli gates 
share either controls or targets, in which case the {CNOT, T} sub-circuits within the Toffolis can 
be combined. Each of the GF(2 m ) multipliers shares this structure. In fact, our algorithm can 
parallelize any GF(2 m ) multiplication circuit to T-depth 2 using sufficient ancillae, by noting that 
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Table 2: Benchmarks using extra ancillae. "T-depth ancilla" denotes the optimized T-depth 
as reported in Table [TJ In cases where the circuit reaches minimum T-depth using at most n 
ancillae, optimization results with n ancillae are identical and thus not reported. Using n ancillae, 
T-depth reductions belong to the range 66.7-93.9%, averaging 82.0%, and unbounded ancilla use 
gave reductions in the range 66.7-99.7%, with an average of 86.7%. 



each such circuit can be written in two {CNOT, T} stages. Those circuits that mix controls and 
targets between adjacent Toffoli gates are less affected by the optimization, e.g., CSUM-MUXg, as 
the {CNOT, T} sub-circuits are separated by H gates. 

We also tested our algorithm's ability to make use of ancillae to optimize T-depth (Table [2]). 
For each of the benchmark circuits, we applied our algorithm with an added n ancillae, where 
the original circuit contained n qubits. We also report the minimum T-depth achievable for each 
circuit using our algorithm, along with the minimum number of ancillae required. It can be noted 
that our algorithm usually decreases in running time when ancillae are added, due to a smaller 
minimum number of partitions in each matroid. As a result, the algorithm is very flexible and 
capable of exploiting ancillae to reduce T-depth. Furthermore, our experimental data illustrates a 
great potential for space-time trade-off in quantum circuits. 

7 Conclusion 

We have described a polynomial-time algorithm for the automated optimization of T-depth and T- 
count in circuits composed of {CNOT, T} gates. Given a specific formula t = Y2i=i c i'/i( a i> a 2> • • ■ , ci n ) 
for the phase exponent in a {CNOT, T} circuit, the algorithm computes the minimal T-depth and 
T-count required to apply each phase rotation u} Ci '^ ai,a ' 2 '"'' an ' . algorithm can be applied to 
sub-circuits of more general quantum circuits and provides significant reductions in both T-depth 
and T-count. For practical uses, the algorithm should be combined with an efficient synthesis algo- 
rithm for linear reversible functions to reduce the number of CNOT stages/gates. Additionally, as 
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illustrated by Table [2j the algorithm provides substantial flexibility in the trade-off between ancilla 
usage and T-depth. 
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